The hypothetical HR department at the fictional Salifort Motors collected employee data to improve satisfaction. They requested data-driven suggestions based on an analysis of this data. The main question is: what factors are likely to make an employee leave the company?
The goal of this project is to analyze the data and build a model to predict employee attrition. By identifying which employees are likely to leave, it may be possible to determine the factors contributing to their departure. The model should be interpretable so HR can design targeted interventions to improve retention. Improving retention can reduce the costs associated with hiring and training new employees.
Stakeholders:
The primary stakeholder is the Human Resources (HR) department, as they will use the results to inform retention strategies. Secondary stakeholders include C-suite executives who oversee company direction, managers implementing day-to-day retention efforts, employees (whose experiences and outcomes are directly affected), and, indirectly, customers—since employee satisfaction can impact customer satisfaction.
Ethical Considerations:
The dataset contains 15,000 rows and 10 columns for the variables listed below.
Note: For more information about the data, refer to its source on Kaggle.
| Variable | Description | |
|---|---|---|
| satisfaction_level | Employee-reported job satisfaction level [0–1] | |
| last_evaluation | Score of employee's last performance review [0–1] | |
| number_project | Number of projects employee contributes to | |
| average_monthly_hours | Average number of hours employee worked per month | |
| time_spend_company | How long the employee has been with the company (years) | |
| Work_accident | Whether or not the employee experienced an accident while at work | |
| left | Whether or not the employee left the company | |
| promotion_last_5years | Whether or not the employee was promoted in the last 5 years | |
| Department | The employee's department | |
| salary | The employee's salary (U.S. dollars) |
Initial Data Observations:
Note: During initial data exploration, several basic data cleaning steps were taken. Columns were renamed to standardized snake_case format for consistency and easier coding. I confirmed there were no missing values, reducing the risk of bias or errors. Outliers were explored but not removed at this stage; they will be addressed as needed during modeling.
Most importantly, there were 3,008 duplicate rows in the dataset. Since it is highly improbable for two employees to have identical responses across all columns, these duplicate entries were removed from the analysis.
Display the first few rows of data:
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | Department | salary | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.38 | 0.53 | 2 | 157 | 3 | 0 | 1 | 0 | sales | low |
| 1 | 0.80 | 0.86 | 5 | 262 | 6 | 0 | 1 | 0 | sales | medium |
| 2 | 0.11 | 0.88 | 7 | 272 | 4 | 0 | 1 | 0 | sales | medium |
| 3 | 0.72 | 0.87 | 5 | 223 | 5 | 0 | 1 | 0 | sales | low |
| 4 | 0.37 | 0.52 | 2 | 159 | 3 | 0 | 1 | 0 | sales | low |
And the descriptive statistics:
| satisfaction_level | last_evaluation | number_project | average_montly_hours | time_spend_company | Work_accident | left | promotion_last_5years | |
|---|---|---|---|---|---|---|---|---|
| count | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 | 14999.000000 |
| mean | 0.612834 | 0.716102 | 3.803054 | 201.050337 | 3.498233 | 0.144610 | 0.238083 | 0.021268 |
| std | 0.248631 | 0.171169 | 1.232592 | 49.943099 | 1.460136 | 0.351719 | 0.425924 | 0.144281 |
| min | 0.090000 | 0.360000 | 2.000000 | 96.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 0.440000 | 0.560000 | 3.000000 | 156.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 50% | 0.640000 | 0.720000 | 4.000000 | 200.000000 | 3.000000 | 0.000000 | 0.000000 | 0.000000 |
| 75% | 0.820000 | 0.870000 | 5.000000 | 245.000000 | 4.000000 | 0.000000 | 0.000000 | 0.000000 |
| max | 1.000000 | 1.000000 | 7.000000 | 310.000000 | 10.000000 | 1.000000 | 1.000000 | 1.000000 |
Department value counts and percent:
| Count | Percent | |
|---|---|---|
| Department | ||
| sales | 4140 | 27.60 |
| technical | 2720 | 18.13 |
| support | 2229 | 14.86 |
| IT | 1227 | 8.18 |
| product_mng | 902 | 6.01 |
| marketing | 858 | 5.72 |
| RandD | 787 | 5.25 |
| accounting | 767 | 5.11 |
| hr | 739 | 4.93 |
| management | 630 | 4.20 |
Salary value counts and percent:
| Count | Percent | |
|---|---|---|
| salary | ||
| low | 7316 | 48.78 |
| medium | 6446 | 42.98 |
| high | 1237 | 8.25 |
Summary:
The data shows a workforce with moderate satisfaction, generally high performance reviews, and a typical tenure of 3–4 years. Most employees have not been promoted recently, and workplace accidents are relatively uncommon. Most employees are in lower salary bands and concentrated in sales, technical, and support roles. There is a notable proportion of employees who have left. There are no extreme outliers, but a few employees have unusually long tenures or high monthly hours.
Note that long-term employees—those at the company for over five years—are outliers in the data.
Note that employees with unusually high average monthly hours or an exceptionally high or low number of projects may also be considered outliers. These cases are not obvious in overall summary statistics or aggregate plots, as their effect is diluted by the much larger group of typical employees. Aggregate statistics alone can mask important subgroup dynamics. These subgroups will become apparent.
Number of tenure outliers: 824
Outliers percentage of total: 6.87%
Some models are more sensitive to outliers than others. For the logistic regression model, tenure outliers will be removed.
First, look at the distribution of employees who left versus those who stayed.
| Count | Percent | |
|---|---|---|
| Stayed | 10000 | 83.4 |
| Left | 1991 | 16.6 |
Employees working on 3–4 projects generally stayed. Most groups worked more than a typical 40-hour workweek.
Attrition is highest at the 4–5 year mark, with a sharp drop-off in departures after 5 years. This suggests a critical window for retention efforts. Employees who make it past 5 years are much more likely to stay.
Both leavers and stayers tend to have similar evaluation scores, though some employees with high evaluations still leave—often those who are overworked. This suggests that strong performance alone does not guarantee retention if other factors (like satisfaction or workload) are problematic.
Relationships Between Variables:
Distributions in the Data:
Ethical Considerations:
Note:
This data is clearly synthetic—it's too clean, and the clusters in the charts are much neater than what you’d see in real-world HR data.
These are overview plots that provide a broad look at the data. After these, we’ll focus on individual features in more detail. The goal here is to give an initial sense of the dataset’s structure and key patterns.
Pairplots show the relationships between features, with the diagonal displaying each feature’s distribution.
Boxplots summarize the overall distribution of each feature. However, as noted earlier, aggregate plots can sometimes hide important subgroups or outliers.
Violin plots are especially useful here, as they reveal the presence of distinct subgroups. For example, in satisfaction_level, you can see the extremely miserable and somewhat dissatisfied employees, along with those who left for more typical reasons. In last_evaluation and average_monthly_hours, employees who left cluster at both extremes, while those who stayed are more evenly distributed. For number_project, leavers are concentrated at both the low and high ends, and for tenure, there is a noticeable spike in departures around the 4–5 year mark.
Finally, we include histograms for each feature—first normalized (to compare proportions between leavers and stayers), and then as raw counts.
Raw counts:
There are two prominent clusters among employees who left: one group with very low satisfaction who worked long hours, and another group who worked fewer than 40 hours per week and reported moderate dissatisfaction.
The pattern is similar among employees who left: those with very low satisfaction often received high evaluations, while those with moderate dissatisfaction tended to have realtively low evaluation scores.
| Mean | Median | |
|---|---|---|
| left | ||
| Stayed | 0.667365 | 0.69 |
| Left | 0.440271 | 0.41 |
Employees who left were, on average, 22.7% less satisfied (mean) and 28% less satisfied (median) than those who stayed.
Employees especially quit at the 4 and 5 year mark. Almost nobody quits after 5 years.
| Tenure | Left | Count | Percent | |
|---|---|---|---|---|
| 0 | 2 | Stayed | 2879 | 98.934708 |
| 1 | 2 | Left | 31 | 1.065292 |
| 2 | 3 | Stayed | 4316 | 83.159923 |
| 3 | 3 | Left | 874 | 16.840077 |
| 4 | 4 | Stayed | 1510 | 75.311721 |
| 5 | 4 | Left | 495 | 24.688279 |
| 6 | 5 | Stayed | 580 | 54.613936 |
| 7 | 5 | Left | 482 | 45.386064 |
| 8 | 6 | Stayed | 433 | 79.889299 |
| 9 | 6 | Left | 109 | 20.110701 |
| 10 | 7 | Stayed | 94 | 100.000000 |
| 11 | 7 | Left | 0 | 0.000000 |
| 12 | 8 | Stayed | 81 | 100.000000 |
| 13 | 8 | Left | 0 | 0.000000 |
| 14 | 10 | Stayed | 107 | 100.000000 |
| 15 | 10 | Left | 0 | 0.000000 |
A band of employees with low satisfaction is especially evident at four years of tenure.
There is a clear grouping of leavers who consistently worked long hours (i.e., many in excess of a 60-hour work week). In fact, most employees at this company work above a standard 40-hour work week.
| Item | number_project | left | Count | Percent |
|---|---|---|---|---|
| 0 | 2 | Left | 857.0 | 54.17 |
| 1 | 2 | Stayed | 725.0 | 45.83 |
| 2 | 3 | Left | 38.0 | 1.08 |
| 3 | 3 | Stayed | 3482.0 | 98.92 |
| 4 | 4 | Left | 237.0 | 6.43 |
| 5 | 4 | Stayed | 3448.0 | 93.57 |
| 6 | 5 | Left | 343.0 | 15.36 |
| 7 | 5 | Stayed | 1890.0 | 84.64 |
| 8 | 6 | Left | 371.0 | 44.92 |
| 9 | 6 | Stayed | 455.0 | 55.08 |
| 10 | 7 | Left | 145.0 | 100.00 |
| 11 | 7 | Stayed | 0.0 | 0.00 |
The number of projects is a strong predictor of attrition. Employees at both the low and high extremes are more likely to leave, and notably, all employees with 7 projects left the company.
There are no notable outliers among employees who stayed. Among those who left, both overworked and underworked patterns are evident, as well as a group who left for more typical reasons. Interestingly, a few employees with 7 projects reported unusually low monthly hours, which may indicate data anomalies or unique circumstances.
Among employees who left, dissatisfaction is most evident for those assigned a very high number of projects. Conversely, those with fewer projects also show signs of lower satisfaction, possibly indicating disengagement.
There are no clear patterns linking number of projects, salary, and attrition. However, the relatively small group of high-salaried employees tends to fall in the middle range for number of projects.
Salary does not show a discernible relationship with attrition; the 'high' salary group is much smaller than the others by an order of magnitude, limiting its impact on overall trends.
| Item | salary | Left | Count | Percent |
|---|---|---|---|---|
| 0 | high | Left | 48.0 | 4.85 |
| 1 | high | Stayed | 942.0 | 95.15 |
| 2 | low | Left | 1174.0 | 20.45 |
| 3 | low | Stayed | 4566.0 | 79.55 |
| 4 | medium | Left | 769.0 | 14.62 |
| 5 | medium | Stayed | 4492.0 | 85.38 |
Promotions were rare and, notably, all of the employees with the highest workload left.
| Item | work_accident | left | Count | Percent |
|---|---|---|---|---|
| 0 | No | Left | 1886.0 | 18.60 |
| 1 | No | Stayed | 8255.0 | 81.40 |
| 2 | Yes | Left | 105.0 | 5.68 |
| 3 | Yes | Stayed | 1745.0 | 94.32 |
Somewhat unexpectedly, having a work accident is associated with a lower likelihood of leaving. This could suggest that employees who experience an accident may receive increased support or attention from HR or the company, which encourages them to stay. However, this association could also be coincidental.
Department-level attrition closely matches the overall stay/leave split (83%/17%), suggesting department itself is not a major factor. More granular data (e.g., by manager or team) might uncover specific problem areas, but nothing stands out in the current breakdown.
| Item | department | Left | Count | Percent |
|---|---|---|---|---|
| 0 | IT | Left | 158.0 | 16.19 |
| 1 | IT | Stayed | 818.0 | 83.81 |
| 2 | RandD | Left | 85.0 | 12.25 |
| 3 | RandD | Stayed | 609.0 | 87.75 |
| 4 | accounting | Left | 109.0 | 17.55 |
| 5 | accounting | Stayed | 512.0 | 82.45 |
| 6 | hr | Left | 113.0 | 18.80 |
| 7 | hr | Stayed | 488.0 | 81.20 |
| 8 | management | Left | 52.0 | 11.93 |
| 9 | management | Stayed | 384.0 | 88.07 |
| 10 | marketing | Left | 112.0 | 16.64 |
| 11 | marketing | Stayed | 561.0 | 83.36 |
| 12 | product_mng | Left | 110.0 | 16.03 |
| 13 | product_mng | Stayed | 576.0 | 83.97 |
| 14 | sales | Left | 550.0 | 16.98 |
| 15 | sales | Stayed | 2689.0 | 83.02 |
| 16 | support | Left | 312.0 | 17.13 |
| 17 | support | Stayed | 1509.0 | 82.87 |
| 18 | technical | Left | 390.0 | 17.38 |
| 19 | technical | Stayed | 1854.0 | 82.62 |
The correlation matrix shows no strong multicollinearity among the features, meaning the variables are not highly redundant. Employee attrition (leaving) is most strongly and negatively correlated with satisfaction level, indicating that less satisfied employees are more likely to leave. There are moderate positive correlations between leaving and variables such as average monthly hours, last evaluation, and number of projects, suggesting that higher values in these features are associated with a greater likelihood of attrition.
The data suggests significant challenges with employee retention at this company. Two main groups of leavers emerge:
A majority of the workforce greatly exceeds the typical 40-hour work week (160–184 hours per month), pointing to a workplace culture that expects long hours. The combination of high workload and limited opportunities for advancement likely fuels dissatisfaction and increases the risk of turnover.
Performance evaluations show only a weak link to attrition; both those who left and those who stayed received similar review scores. This indicates that strong performance alone does not guarantee retention, especially if employees are overworked or lack opportunities for growth.
Other variables—such as department, salary, and work accidents—do not show strong predictive value for employee churn compared to satisfaction and workload. Overall, the data points to issues with workload management and limited career progression as the main factors driving employee turnover at this company.